About the project

*T


Regression and model validation

learning2014=read.table("data/learning2014.txt", sep = ",", header = TRUE)

str(learning2014)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ Age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ Attitude: int  37 31 25 35 37 38 35 29 38 21 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ Points  : int  25 12 24 10 22 21 21 31 24 26 ...
dim(learning2014)
## [1] 166   7

Aineistossa on 166 havaintoyksikk


alc=read.table("data/alc.txt", sep = ",", header = TRUE)

head(alc)
##   school sex age address famsize Pstatus Medu Fedu     Mjob     Fjob
## 1     GP   F  18       U     GT3       A    4    4  at_home  teacher
## 2     GP   F  17       U     GT3       T    1    1  at_home    other
## 3     GP   F  15       U     LE3       T    1    1  at_home    other
## 4     GP   F  15       U     GT3       T    4    2   health services
## 5     GP   F  16       U     GT3       T    3    3    other    other
## 6     GP   M  16       U     LE3       T    4    3 services    other
##       reason nursery internet guardian traveltime studytime failures
## 1     course     yes       no   mother          2         2        0
## 2     course      no      yes   father          1         2        0
## 3      other     yes      yes   mother          1         2        2
## 4       home     yes      yes   mother          1         3        0
## 5       home     yes       no   father          1         2        0
## 6 reputation     yes      yes   mother          1         2        0
##   schoolsup famsup paid activities higher romantic famrel freetime goout
## 1       yes     no   no         no    yes       no      4        3     4
## 2        no    yes   no         no    yes       no      5        3     3
## 3       yes     no  yes         no    yes       no      4        3     2
## 4        no    yes  yes        yes    yes      yes      3        2     2
## 5        no    yes  yes         no    yes       no      4        3     2
## 6        no    yes  yes        yes    yes       no      5        4     2
##   Dalc Walc health absences G1 G2 G3 alc_use high_use
## 1    1    1      3        5  2  8  8     1.0    FALSE
## 2    1    1      3        3  7  8  8     1.0    FALSE
## 3    2    3      3        8 10 10 11     2.5     TRUE
## 4    1    1      5        1 14 14 14     1.0    FALSE
## 5    1    2      5        2  8 12 12     1.5    FALSE
## 6    1    2      5        8 14 14 14     1.5    FALSE
attach(alc)

Aineiston muuttujia ovat esimerkiksi alkoholin käyttö viikolla, alkoholin käyttö viikonloppuna, terveys, poissaolot, alkoholin käyttö koko viikon aikana, suuri alkoholinkulutus, opiskeluun käytetty aika, huoltaja, perheen koko, osoite, ikä, sukupuoli ja koulu. Siinä oli vain osa muuttujista. Loput näkyvät tulosteesta.

Tutkin seuraavaksi suuren alkoholinkäytön joka on looginen muuttuja yhteyttä terveyteen, poissaoloihin, opiskeluun käytettyyn aikaan ja sukupuoleen. Hypoteesini on että terveys ja suuri alkoholinkäyttö korreloivat negatiivisesti. Arvioin myös että opiskeluun käytetyllä ajalla ja suurella alkoholinkäytöllä on negatiivinen korrelaatio. Arvioin myös että poissaolot ja suuri alkoholinkäyttö korreloivat positiivisesti. Hypoteesini on että sukupuoli vaikuttaa suureen alkoholinkäyttöön niin että miehissä on enemmän heitä jotka käyttävät paljon alkoholia eli heitä joille high_use-muuttuja saa arvon TRUE.

table(health,high_use)
##       high_use
## health FALSE TRUE
##      1    35   11
##      2    28   16
##      3    62   19
##      4    49   18
##      5    94   50
table(absences,high_use)
##         high_use
## absences FALSE TRUE
##       0     52   13
##       1     38   13
##       2     42   16
##       3     33    8
##       4     24   12
##       5     16    6
##       6     16    5
##       7      9    3
##       8     14    6
##       9      6    6
##       10     5    2
##       11     2    4
##       12     4    4
##       13     1    1
##       14     1    6
##       16     0    1
##       17     0    1
##       18     1    1
##       19     0    1
##       20     2    0
##       21     1    1
##       26     0    1
##       27     0    1
##       29     0    1
##       44     0    1
##       45     1    0
table(studytime,high_use)
##          high_use
## studytime FALSE TRUE
##         1    58   42
##         2   135   60
##         3    52    8
##         4    23    4
table(sex,high_use)
##    high_use
## sex FALSE TRUE
##   F   156   42
##   M   112   72
barplot(health,high_use)

barplot(absences,high_use)

barplot(studytime,high_use)

boxplot(health,high_use)

boxplot(absences,high_use)

boxplot(studytime,high_use)

boxplot(sex,high_use)

Laatikkokuvista nähdään että hypoteesini terveyden ja suuren alkoholinkäytön ja opiskeluun käytetyn ajan ja suuren alkoholinkäytön välisistä suhteista olivat oikein. Niiden joiden alkoholinkäyttö ei ole suurta terveyden mediaani ja ala- ja yläkvantiilit ovat selvästi korkeammalla tasolla kuin heidän joiden alkoholinkäyttö on suurta vastaavat tunnusluvut. Samoin niiden joiden alkoholinkäyttö ei ole suurta opiskeluun käytetyn ajan mediaani ja ala- ja yläkvantiilit ovat korkeammalla tasolla kuin heidän joiden alkoholinkäyttö on suurta vastaavat tunnusluvut. Alkoholinkäytön kasvaminen siis vaikuttaa terveyteen ja opiskeluun käytettyyn aikaan negatiivisesti. Poissaolojen suhteen arvioni meni pieleen. Heidän joiden alkoholinkäyttö ei ole suurta poissaolojen mediaani ja ylä- ja alakvantiilit ovat korkeammalla tasolla kuin heidän joiden alkoholinkäyttö on suurta vastaavat tunnusluvut. Alkoholinkäytön kasvaminen siis vaikuttaa poissaoloihin negatiivisesti mikä oli päinvastoin kuin mitä arvioin. Ristiintaulukosta nähdään että sellaisia miehiä joiden alkoholinkäyttö on suurta on enemmän kuin sellaisia naisia joiden alkoholinkäyttö on suurta. Sellaisia miehiä joiden alkoholinkäyttö ei ole suurta on vähemmän kuin sellaisia naisia joiden alkoholinkäyttö ei ole suurta. Miehissä on siis oltava suhteellisesti suurempi osuus heitä joiden alkoholinkäyttö on suurta kuin naisissa. Hypoteesini oli sukupuolen ja suuren alkoholinkäytön suhteen osalta oikea.

m = glm(high_use ~ health + absences + studytime + sex, data = alc, family = "binomial")

summary(m)
## 
## Call:
## glm(formula = high_use ~ health + absences + studytime + sex, 
##     family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2074  -0.8367  -0.5983   1.0910   2.1407  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.08010    0.52055  -2.075 0.037995 *  
## health       0.04493    0.08631   0.521 0.602675    
## absences     0.08881    0.02307   3.850 0.000118 ***
## studytime   -0.39817    0.15960  -2.495 0.012603 *  
## sexM         0.78234    0.25261   3.097 0.001954 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 423.00  on 377  degrees of freedom
## AIC: 433
## 
## Number of Fisher Scoring iterations: 4
coef(m)
## (Intercept)      health    absences   studytime        sexM 
## -1.08009806  0.04492699  0.08881324 -0.39817186  0.78234486

Logistisen regressiomallin estimoinnin tulostuksesta nähdään että kun terveys kasvaa yhdellä, logaritmi suuren alkoholinkäytön todennäköisyyden ja yksi miinus suuren alkoholinkäytön todennäköisyyden osamäärästä eli log(p/(1-p)):stä kasvaa 0,04 verran. Samoin jos poissaolot kasvavat yhdellä, log(p/(1-p)) kasvaa noin 0,09 verran. Opiskeluun käytetty aika -muuttujan kerroin tulkitaan vastaavalla tavalla. Sukupuoli on luokiteltu muuttuja ja sen kertoimen tulkinta on erilainen kuin muiden muuttujien kertoimien tulkinta. R on valinnut naiset vertailuryhmäksi ja miehet ovat selittävä muuttuja. Vertailuryhmän eli naisten kerroin on mallin vakiotermi. Miesten kerroin kuvaa eroa vakiotermiin eli naisten kertoimeen. Miesten todellinen kerroin on vakiotermi-miesten kerroin eli noin -1.08-0.78=-0.3. Selittäjistä tilastollisesti merkitsevä kertoimen estimaatti on poissaoloilla, opiskeluun käytetyllä ajalla, miehillä ja vakiotermillä. Se että miesten kertoimen estimaatti on tilastollisesti merkitsevä tarkoittaa että hypoteesi että vakiotermin eli vertailuryhmän kertoimen ja miesten kertoimen erotus on nolla voidaan hylätä yhden prosentin merkitsevyystasolla. Terveyden kertoimen estimaatti ei ole tilastollisesti merkitsevä eli hypoteesia siitä että terveyden kerroin on nolla ei voida hylätä yleisillä merkitsevyystasoilla.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
OR =coef(m) %>% exp

CI=exp(confint(m))
## Waiting for profiling to be done...
cbind(OR, CI)
##                    OR     2.5 %    97.5 %
## (Intercept) 0.3395622 0.1203826 0.9317189
## health      1.0459515 0.8842249 1.2412355
## absences    1.0928765 1.0468090 1.1460148
## studytime   0.6715466 0.4866129 0.9113044
## sexM        2.1865935 1.3377030 3.6085532

Terveyden odds ratio kertoo suuren alkoholinkäytön todennäköisyyden henkilöillä joilla terveys-muuttuja muuttuu yhdellä yksiköllä ja suuren alkoholinkäytön todennäköisyyden henkilöillä jolla terveys-muuttuja ei muutu yhdellä yksiköllä osamäärän. Koska se on suurempaa kuin yksi, terveyden muutos yhdellä yksiköllä liittyy positiivisesti suureen alkoholinkäyttöön. Tämä on eri tulos kuin hypoteesissani arvioin. Myös poissaolojen odds ratio on yli yksi eli poissaolojen muutos yhdellä yksiköllä liittyy positiivisesti suureen alkoholinkäyttöön. Se on hypoteesini mukainen tulos. Opiskeluun käytetyn ajan odds ratio on alle yksi mikä tarkoittaa että opiskeluun käytetyn ajan muutos yhdellä yksiköllä liittyy negatiivisesti suureen alkoholinkäyttöön. Se on arvioni mukainen tulos. Se että on mies näyttää liittyvän positiivisesti suureen alkoholinkäyttöön, sillä miesten odds ratio on yli kaksi. Se on arvioni mukaista.

#Lasketaan malli ilman terveysmuuttujaa jonka kertoimen estimaatti ei ollut tilastollisesti merkitsevä. 
m = glm(high_use ~ absences + studytime + sex, data = alc, family = 
"binomial")

#Ennustetaan suuren alkoholinkäytön todennäköisyys.
probabilities = predict(m, type = "response")

#Lisätään ennustetut todennäköisyydet aineistoon.
alc = mutate(alc, probability = probabilities)

#Käytetään todennäköisyyksiä suuren alkoholinkäytön ennustamiseen,
alc = mutate(alc, prediction = probabilities>0.5)

#Ristiintaulukoidaan korkea alkoholinkäyttö-muuttuja ennusteiden muuttujasta kanssa.
table(high_use = alc$high_use, prediction = probabilities>0.5)
##         prediction
## high_use FALSE TRUE
##    FALSE   258   10
##    TRUE     88   26

Ristiintaulukosta nähdään että suuri alkoholinkäyttö-muuttujan arvot poikkeavat siitä logit-mallin avulla tehdyistä ennusteista vähän. 10 kertaa ennuste on suuri alkoholinkäyttö silloin kun muuttuja saa ei suurta alkoholinkäyttöä vastaavan arvon. Samoin 88 kertaa ennuste on ei suurta alkoholinkäyttöä kun muuttuja saa suurta alkoholinkäyttöä vastaavan arvon.

# Haetaan paketit dplyr ja ggplot2.
library(dplyr); library(ggplot2)

# Piirretään hajontakuva korkeasta alkoholin käytöstä ja sen ennustetuista todennäköisyyksistä.
g = ggplot(alc, aes(x = probability, y = high_use, col=prediction))

# Määritellään geom pisteinä ja piirretään uusi kuva. 
g+geom_point()

#Taulukoidaan suuri alkoholinkäyttö ja ennusteet siitä. 
table(high_use = alc$high_use, prediction = alc$prediction)%>%prop.table()%>%addmargins()
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.67539267 0.02617801 0.70157068
##    TRUE  0.23036649 0.06806283 0.29842932
##    Sum   0.90575916 0.09424084 1.00000000
# Määritellään loss function eli keskimääräinen ennustevirhe. 
loss_func = function(class, prob) {
  n_wrong = abs(class - prob) > 0.5
  mean(n_wrong)
}

# Kutsutaan loss funktio jotta voidaan laskea väärien ennusteiden keskimääräinen määrä aineistossa. 
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2565445

Väärin luokiteltuja henkilöitä on noin 26 prosenttia kaikista henkilöistä. Ylemmästä taulukosta nähdään että ennusteen mukaan henkilöitä joilla ei ole suurta alkoholinkäyttöä on noin 91 prosenttia kaikista henkilöistä. Muuttujan arvojen mukaan henkilöitä joilla ei ole suurta alkoholinkäyttöä on noin 70 prosenttia kaikista henkilöistä. Huomataan taas että ennuste eroaa muuttujan arvoista.

``` ***

Clustering and classification

# access the MASS package
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
# load the data
data("Boston")

# explore the dataset
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506  14

We loaded the Boston data from the MASS package. We explored the structure and the dimensions of the data. From the output of the r command dim we see that the data contains 14 variables and 506 observations. From the output of r command str we see that all variables are numerical and some of them get integer values. From the web page concerning the Boston dataset we saw that the variables are the following

crim

per capita crime rate by town.

zn

proportion of residential land zoned for lots over 25,000 sq.ft.

indus

proportion of non-retail business acres per town.

chas

Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).

nox

nitrogen oxides concentration (parts per 10 million).

rm

average number of rooms per dwelling.

age

proportion of owner-occupied units built prior to 1940.

dis

weighted mean of distances to five Boston employment centres.

rad

index of accessibility to radial highways.

tax

full-value property-tax rate per \$10,000.

ptratio

pupil-teacher ratio by town.

black

1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.

lstat

lower status of the population (percent).

medv

median value of owner-occupied homes in \$1000s.
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
library(ggplot2)

library(GGally)
## 
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
## 
##     nasa
pairs(Boston[-1])

p=ggpairs(Boston, mapping = aes(alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))

p

# calculate the correlation matrix
cor_matrix<-cor(Boston) 

# print the correlation matrix
cor_matrix
##                crim          zn       indus         chas         nox
## crim     1.00000000 -0.20046922  0.40658341 -0.055891582  0.42097171
## zn      -0.20046922  1.00000000 -0.53382819 -0.042696719 -0.51660371
## indus    0.40658341 -0.53382819  1.00000000  0.062938027  0.76365145
## chas    -0.05589158 -0.04269672  0.06293803  1.000000000  0.09120281
## nox      0.42097171 -0.51660371  0.76365145  0.091202807  1.00000000
## rm      -0.21924670  0.31199059 -0.39167585  0.091251225 -0.30218819
## age      0.35273425 -0.56953734  0.64477851  0.086517774  0.73147010
## dis     -0.37967009  0.66440822 -0.70802699 -0.099175780 -0.76923011
## rad      0.62550515 -0.31194783  0.59512927 -0.007368241  0.61144056
## tax      0.58276431 -0.31456332  0.72076018 -0.035586518  0.66802320
## ptratio  0.28994558 -0.39167855  0.38324756 -0.121515174  0.18893268
## black   -0.38506394  0.17552032 -0.35697654  0.048788485 -0.38005064
## lstat    0.45562148 -0.41299457  0.60379972 -0.053929298  0.59087892
## medv    -0.38830461  0.36044534 -0.48372516  0.175260177 -0.42732077
##                  rm         age         dis          rad         tax
## crim    -0.21924670  0.35273425 -0.37967009  0.625505145  0.58276431
## zn       0.31199059 -0.56953734  0.66440822 -0.311947826 -0.31456332
## indus   -0.39167585  0.64477851 -0.70802699  0.595129275  0.72076018
## chas     0.09125123  0.08651777 -0.09917578 -0.007368241 -0.03558652
## nox     -0.30218819  0.73147010 -0.76923011  0.611440563  0.66802320
## rm       1.00000000 -0.24026493  0.20524621 -0.209846668 -0.29204783
## age     -0.24026493  1.00000000 -0.74788054  0.456022452  0.50645559
## dis      0.20524621 -0.74788054  1.00000000 -0.494587930 -0.53443158
## rad     -0.20984667  0.45602245 -0.49458793  1.000000000  0.91022819
## tax     -0.29204783  0.50645559 -0.53443158  0.910228189  1.00000000
## ptratio -0.35550149  0.26151501 -0.23247054  0.464741179  0.46085304
## black    0.12806864 -0.27353398  0.29151167 -0.444412816 -0.44180801
## lstat   -0.61380827  0.60233853 -0.49699583  0.488676335  0.54399341
## medv     0.69535995 -0.37695457  0.24992873 -0.381626231 -0.46853593
##            ptratio       black      lstat       medv
## crim     0.2899456 -0.38506394  0.4556215 -0.3883046
## zn      -0.3916785  0.17552032 -0.4129946  0.3604453
## indus    0.3832476 -0.35697654  0.6037997 -0.4837252
## chas    -0.1215152  0.04878848 -0.0539293  0.1752602
## nox      0.1889327 -0.38005064  0.5908789 -0.4273208
## rm      -0.3555015  0.12806864 -0.6138083  0.6953599
## age      0.2615150 -0.27353398  0.6023385 -0.3769546
## dis     -0.2324705  0.29151167 -0.4969958  0.2499287
## rad      0.4647412 -0.44441282  0.4886763 -0.3816262
## tax      0.4608530 -0.44180801  0.5439934 -0.4685359
## ptratio  1.0000000 -0.17738330  0.3740443 -0.5077867
## black   -0.1773833  1.00000000 -0.3660869  0.3334608
## lstat    0.3740443 -0.36608690  1.0000000 -0.7376627
## medv    -0.5077867  0.33346082 -0.7376627  1.0000000

We showed a graphical overview of the data and summaries of the variables in the data. From the summaries we can see that for example minimum of crime rate in one town is about 0.006, median is about 0.257, mean is about 3.614 and maximum is about 88.976. From the graph we can see that the distributions of one variable against another are rarely normal. Only variable average number of rooms per dwelling´s distribution when the same variable, average number of rooms per dwelling, is on the y-axis looks normal. From correlation matrix we see that for example crime rate in one town correlates a lot with index of accessibility to radial highways and with full-value property-tax rate per 10,000 dollars. Correlation of crime rate with index of accessibility to radial highways is 0.63. Correlation of crime rate and full-value property-tax rate per 10,000 dollars is 0.58.

# center and standardize variables
boston_scaled <- scale(Boston)

# summaries of the scaled variables
summary(boston_scaled)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865
# change the object to data frame
boston_scaled=as.data.frame(boston_scaled)

We scaled the variables in the Boston dataset. From the summary of the scaled variables we see that for example statistics of crime rate have changed radically. Minimum changed from about 0.006 to about -0.419. Median changed from about 0.257 to about -0.39. Mean changed from about 3.614 to 0 and maximum changed from about 88.976 to about 9.924110.

# save the scaled crim as scaled_crim
scaled_crim <- boston_scaled$crim

# create a quantile vector of crim and print it
bins <- quantile(scaled_crim)

# create a categorical variable 'crime', label-käsky on varmaan väärin
crime <- cut(scaled_crim, breaks = bins, include.lowest = TRUE, label=c("low","med_low","med_high","high"))

# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)

# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)

# number of rows in the Boston dataset 
n <- nrow(boston_scaled)

# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)

# create train set
train <- boston_scaled[ind,]

# create test set 
test <- boston_scaled[-ind,]

We have created a categorical variable of the crime rate in the Boston dataset from the scaled crime rate. We used the quantiles as the break points in the categorical variable and dropped the old crime rate variable from the dataset. We also divided the dataset to train and test sets, so that 80% of the data belongs to the train set.

# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)

# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2623762 0.2500000 0.2500000 0.2376238 
## 
## Group means:
##                   zn      indus          chas        nox         rm
## low       0.90707575 -0.8973747 -0.1237592475 -0.8703857  0.4766513
## med_low  -0.07905834 -0.3019858  0.0005392655 -0.5719843 -0.1094363
## med_high -0.35818425  0.1474470  0.2734075986  0.4106241  0.2196304
## high     -0.48724019  1.0172418 -0.0262603030  1.0565447 -0.4503831
##                 age        dis        rad        tax     ptratio
## low      -0.9196188  0.8066139 -0.6774191 -0.7181744 -0.43482972
## med_low  -0.2895022  0.3621515 -0.5565973 -0.4961714 -0.08098845
## med_high  0.4161521 -0.4022303 -0.3962667 -0.3107082 -0.36590632
## high      0.7933304 -0.8460220  1.6368728  1.5131579  0.77931510
##               black       lstat         medv
## low       0.3817438 -0.79445499  0.546236881
## med_low   0.3174582 -0.11323356  0.002892173
## med_high  0.1320921 -0.09265809  0.272025533
## high     -0.7748230  0.86354448 -0.691510023
## 
## Coefficients of linear discriminants:
##                 LD1         LD2         LD3
## zn       0.09294862  0.60936151 -0.88167027
## indus   -0.00621148 -0.25168195  0.33188574
## chas    -0.10083793 -0.04503492  0.08033914
## nox      0.41537214 -0.78609767 -1.29999505
## rm      -0.11003205 -0.08038620 -0.19793863
## age      0.28364238 -0.39905517  0.08431566
## dis     -0.08409844 -0.24632495  0.34272184
## rad      3.11688718  0.86056118 -0.09143322
## tax      0.06462302  0.06941172  0.36565249
## ptratio  0.11612059  0.04517767 -0.21024686
## black   -0.13085551  0.02170349  0.09462636
## lstat    0.24532588 -0.13844215  0.63963292
## medv     0.22569652 -0.41695274  0.02044200
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9467 0.0390 0.0142
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "orange", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)

We fitted the linear discriminant analysis that is LDA on the train set. We used the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables. We drew the LDA (bi)plot.

# save the correct classes from test data
correct_classes <- test$crime

# remove the crime variable from test data
test <- dplyr::select(test, -crime)

# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)

# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       12       9        0    0
##   med_low    7      13        5    0
##   med_high   0      13       11    1
##   high       0       0        0   31

We saved the crime categories from the test set and then removed the categorical crime variable from the test dataset. Then we predicted the classes with the LDA model on the test data. We also cross tabulated the results from prediction with the crime categories from the test set. We notice that the predicted values of the class variable that are predicted on the test data using the results from the LDA on the train data differ partly from the observed values of the class variable in the test data. Anyhow the predictions are same as the observed values more frequently that the predictions differ from observed values. Prediction is hence quite good.

# load the data
data("Boston")

# center and standardize variables
boston_scaled <- scale(Boston)

# change the object to data frame
boston_scaled=as.data.frame(boston_scaled)

# euclidean distance matrix
dist_eu <- dist(boston_scaled)

# look at the summary of the distances
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4620  4.8240  4.9110  6.1860 14.4000
# manhattan distance matrix
dist_man <- dist(boston_scaled, method="manhattan")

# look at the summary of the distances

summary(dist_man)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2662  8.4830 12.6100 13.5500 17.7600 48.8600
# k-means clustering
km <-kmeans(dist_eu, centers = 15)

# plot the Boston dataset with clusters
pairs(boston_scaled, col = km$cluster)

set.seed(123)


# determine the number of clusters
k_max <- 15

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(dist_eu, k)$tot.withinss})

# visualize the results
plot(1:k_max, twcss, type='b')

# k-means clustering
km <-kmeans(dist_eu, centers = 2)

# plot the Boston dataset with clusters
pairs(boston_scaled, col = km$cluster)

We reloaded the Boston dataset and standardized the dataset. We calculated the distances between the observations using the euclidian and manhattan distance matrixes. We ran k-means algorithm on the dataset and visualized the clusters with pairs-function where clusters are separated with colors. Then we investigated what is the optimal number of clusters is with total within sum of squares. We drew a picture where the number of clusters in on the x-axis and quantity of total within sum of squares on the y-axis. From the picture we saw that total sum of squares gets its highest value at one. We thought one is too small number of clusters and chose two for the number of clusters. We ran the k-means algorithm again and chose two as the number of centers. We visualized the clusters again with pairs-function.

Super-bonus-exercise

model_predictors <- dplyr::select(train, -crime)

# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)

library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color=train$crime)

We created a matrix product, which is a projection of the data points. We created a 3D plot of the columns of the matrix product. We added argument color as a argument in the plot_ly() function and set the color to be the crime classes of the train set. Hence we got two plots. In the second plot, crime classes low, medium low, medium high and high are separated with colors. Otherwise two plots are similar. ***

Dimensionality reduction techniques

human=read.table("data/human.txt", sep = ",", header= TRUE)

str(human)
## 'data.frame':    155 obs. of  8 variables:
##  $ Edu2.FM  : num  1.007 0.997 0.983 0.989 0.969 ...
##  $ Labo.FM  : num  0.891 0.819 0.825 0.884 0.829 ...
##  $ Life.Exp : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ Edu.Exp  : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ GNI      : Factor w/ 155 levels "1,123","1,228",..: 132 109 124 112 113 111 103 123 108 93 ...
##  $ Mat.Mor  : int  4 6 6 5 6 7 9 28 11 8 ...
##  $ Ado.Birth: num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ Parli.F  : num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
dim(human)
## [1] 155   8

Data includes 155 observations and 8 variables. Variables are ratio of Female and Male populations with secondary education in each country, ratio of labour force participation of females and males in each country, life expectancy, expected yaears of schooling, gross national income per capita, maternal mortality ratio, adolescent birth rate and female and male shares of parliamentary seats. Most of the variables are numerical. Only income per capita is factor variable.

From the picture where are the distributions and the correlations of the variables we see that most variables have non-normal distributions. Only the distribution of expected years of schooling looks normal.

From the correlation picture we see that the largest correlations are between life expectancy and expected years of schooling, maternal mortality ratio and adolescent birth rate and maternal mortality ratio and life expectancy.

summary(human)
##     Edu2.FM          Labo.FM          Life.Exp        Edu.Exp     
##  Min.   :0.1717   Min.   :0.1857   Min.   :49.00   Min.   : 5.40  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:66.30   1st Qu.:11.25  
##  Median :0.9375   Median :0.7535   Median :74.20   Median :13.50  
##  Mean   :0.8529   Mean   :0.7074   Mean   :71.65   Mean   :13.18  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:77.25   3rd Qu.:15.20  
##  Max.   :1.4967   Max.   :1.0380   Max.   :83.50   Max.   :20.20  
##                                                                   
##       GNI         Mat.Mor         Ado.Birth         Parli.F     
##  1,123  :  1   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1,228  :  1   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  1,428  :  1   Median :  49.0   Median : 33.60   Median :19.30  
##  1,458  :  1   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  1,507  :  1   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  1,583  :  1   Max.   :1100.0   Max.   :204.80   Max.   :57.50  
##  (Other):149
# Access GGally
library(dplyr)
library(GGally)
library(corrplot)

# remove the GNI variable
human_ <- select(human, -GNI)

# visualize the 'human_' variables
ggpairs(human_)

# compute the correlation matrix and visualize it with corrplot
cor(human_)%>%corrplot()

# perform principal component analysis (with the SVD method)
pca_human <- prcomp(human_)

# create and print out a summary of pca_human
s <- summary(pca_human)
s
## Importance of components:
##                             PC1      PC2      PC3     PC4     PC5    PC6
## Standard deviation     214.2426 26.51531 11.47912 4.14729 1.60859 0.1915
## Proportion of Variance   0.9817  0.01504  0.00282 0.00037 0.00006 0.0000
## Cumulative Proportion    0.9817  0.99676  0.99958 0.99994 1.00000 1.0000
##                           PC7
## Standard deviation     0.1599
## Proportion of Variance 0.0000
## Cumulative Proportion  1.0000
# rounded percentages of variance captured by each PC
pca_pr <- round(100*s$importance[2,], digits = 1) 

# print out the percentages of variance
pca_pr
##  PC1  PC2  PC3  PC4  PC5  PC6  PC7 
## 98.2  1.5  0.3  0.0  0.0  0.0  0.0
# create object pc_lab to be used as axis labels
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")

# draw a biplot
biplot(pca_human, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2], sub="Burundi, Sierra Leone and Central African rebublic stand out from others")
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

# standardize the variables
human_std <- scale(human_)

# perform principal component analysis (with the SVD method)
pca_human_std <- prcomp(human_std)

# create and print out a summary of pca_human
s <- summary(pca_human_std)
s
## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6
## Standard deviation     1.9567 1.1387 0.8697 0.70118 0.54003 0.46709
## Proportion of Variance 0.5469 0.1852 0.1081 0.07024 0.04166 0.03117
## Cumulative Proportion  0.5469 0.7322 0.8402 0.91048 0.95214 0.98331
##                            PC7
## Standard deviation     0.34185
## Proportion of Variance 0.01669
## Cumulative Proportion  1.00000
# rounded percentages of variance captured by each PC
pca_pr_std <- round(100*s$importance[2,], digits = 1) 

# print out the percentages of variance
pca_pr_std
##  PC1  PC2  PC3  PC4  PC5  PC6  PC7 
## 54.7 18.5 10.8  7.0  4.2  3.1  1.7
# create object pc_lab to be used as axis labels
pc_lab_std <- paste0(names(pca_pr_std), " (", pca_pr_std, "%)")

# draw a biplot
biplot(pca_human_std, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab_std[1], ylab = pc_lab_std[2], sub="LaboFM and Parli contribute to the second principal component")

From the biplot we see that ratio of labour force participation of females and males correlates a lot with female and male shares of parliamentary seats as the angle between the arrows repserenting them is small. Ratio of labour force participation and shares of parliamentary seats contribute to the second principal component as the arrows representing them are pointing at the same direction as y-axis. The lenght of the arrows describes the stardard deviation of the variables or features. The biplot based of the scaled data looks quite different from the biplot of non-scale data. In the previous Rwuanda stands out from other countries and in the latter Burundi, Sierra Leone and Central African republic stand out.

Tea data contains 300 observations and 36 variables. All other variables are factor variables except age that is numerical integer variable. From the barplot picture we see for example that it is most common to use teabags, drink tee alone and not drink it during lunch.

library(FactoMineR)

data("tea")

str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
##  $ frequency       : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)
## [1] 300  36
# column names to keep in the dataset
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")

# select the 'keep_columns' to create a new dataset
tea_time <- select(tea, one_of(keep_columns))

library(ggplot2)
library(dplyr)
library(tidyr)

# visualize the dataset
gather(tea_time) %>% ggplot(aes(value))+ geom_bar() + facet_wrap("key", scales = "free") + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables; they will
## be dropped

From the bottom of the output of multiple correspondence analysis we can see the correlations between the varibles and the dimensions. We can see for example that the correlations between variable how and dimension one, variable where and dimension one, variable how and dimension two and variable where and dimension two are quite high.

From the multiple correspondence analysis factor map we see that alone, no sugar and not lunch are similar variable categories as they are close to each other. Unpackaged, tea shop and other are different from the other variable categories as they are far from other variable categories. We also see that lemon and lunch are more similar than lemon and unpackaged.

# multiple correspondence analysis
mca <- MCA(tea_time, graph = FALSE)

# summary of the model
summary(mca)
## 
## Call:
## MCA(X = tea_time, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6
## Variance               0.279   0.261   0.219   0.189   0.177   0.156
## % of var.             15.238  14.232  11.964  10.333   9.667   8.519
## Cumulative % of var.  15.238  29.471  41.435  51.768  61.434  69.953
##                        Dim.7   Dim.8   Dim.9  Dim.10  Dim.11
## Variance               0.144   0.141   0.117   0.087   0.062
## % of var.              7.841   7.705   6.392   4.724   3.385
## Cumulative % of var.  77.794  85.500  91.891  96.615 100.000
## 
## Individuals (the 10 first)
##                       Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                  | -0.298  0.106  0.086 | -0.328  0.137  0.105 | -0.327
## 2                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 3                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 4                  | -0.530  0.335  0.460 | -0.318  0.129  0.166 |  0.211
## 5                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 6                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 7                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 8                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 9                  |  0.143  0.024  0.012 |  0.871  0.969  0.435 | -0.067
## 10                 |  0.476  0.271  0.140 |  0.687  0.604  0.291 | -0.650
##                       ctr   cos2  
## 1                   0.163  0.104 |
## 2                   0.735  0.314 |
## 3                   0.062  0.069 |
## 4                   0.068  0.073 |
## 5                   0.062  0.069 |
## 6                   0.062  0.069 |
## 7                   0.062  0.069 |
## 8                   0.735  0.314 |
## 9                   0.007  0.003 |
## 10                  0.643  0.261 |
## 
## Categories (the 10 first)
##                        Dim.1     ctr    cos2  v.test     Dim.2     ctr
## black              |   0.473   3.288   0.073   4.677 |   0.094   0.139
## Earl Grey          |  -0.264   2.680   0.126  -6.137 |   0.123   0.626
## green              |   0.486   1.547   0.029   2.952 |  -0.933   6.111
## alone              |  -0.018   0.012   0.001  -0.418 |  -0.262   2.841
## lemon              |   0.669   2.938   0.055   4.068 |   0.531   1.979
## milk               |  -0.337   1.420   0.030  -3.002 |   0.272   0.990
## other              |   0.288   0.148   0.003   0.876 |   1.820   6.347
## tea bag            |  -0.608  12.499   0.483 -12.023 |  -0.351   4.459
## tea bag+unpackaged |   0.350   2.289   0.056   4.088 |   1.024  20.968
## unpackaged         |   1.958  27.432   0.523  12.499 |  -1.015   7.898
##                       cos2  v.test     Dim.3     ctr    cos2  v.test  
## black                0.003   0.929 |  -1.081  21.888   0.382 -10.692 |
## Earl Grey            0.027   2.867 |   0.433   9.160   0.338  10.053 |
## green                0.107  -5.669 |  -0.108   0.098   0.001  -0.659 |
## alone                0.127  -6.164 |  -0.113   0.627   0.024  -2.655 |
## lemon                0.035   3.226 |   1.329  14.771   0.218   8.081 |
## milk                 0.020   2.422 |   0.013   0.003   0.000   0.116 |
## other                0.102   5.534 |  -2.524  14.526   0.197  -7.676 |
## tea bag              0.161  -6.941 |  -0.065   0.183   0.006  -1.287 |
## tea bag+unpackaged   0.478  11.956 |   0.019   0.009   0.000   0.226 |
## unpackaged           0.141  -6.482 |   0.257   0.602   0.009   1.640 |
## 
## Categorical variables (eta2)
##                      Dim.1 Dim.2 Dim.3  
## Tea                | 0.126 0.108 0.410 |
## How                | 0.076 0.190 0.394 |
## how                | 0.708 0.522 0.010 |
## sugar              | 0.065 0.001 0.336 |
## where              | 0.702 0.681 0.055 |
## lunch              | 0.000 0.064 0.111 |
# visualize MCA
plot(mca, invisible=c("ind"), habillage = "quali")

***